Problem description

In this project I analysed a dataset regarding deaths in Milan. The dataset contains, for each day between 1/01/1980 and 31/12/1989:

We have for each day some explanatory variables too:

The dataset has been described in Vigotti, M.A., Rossi, G., Bisanti, L., Zanobetti, A. and Schwartz, J. (1996). Short term effect of urban air pollution on respiratory health in Milan, Italy, 1980-1989. Journal of Epidemiology and Community Health, 50, 71-75. and it has been analysed in Ruppert, D., Wand, M.P. and Carroll, R.J. (2003), Semiparametric Regression Cambridge University Press..

The aim of the project is to build a predictive model for tot.mort and resp.mort and to understand which are the variables that mostly affect the probability of death.

Explorative Data Analysis

Univariate descriptive analysis

Response variables

In the following plots it is represented the distribution of tot.mort and resp.mort. As we can see, the determinations of tot.mort are big enough to model it as a continuous variable, while the determinations of resp.mort are much lower.

Milan population

The following plot shows the evolution of the Milan population, according to the ISTAT census. The data can be found here. As we can see, the population between 1980 and 1990 decreased. In order to take into account the decrement of the number of people subjected to the risk, I computed the population day by day with a linear interpolation of the census dates and used this piece of data as an exposure in the models. With this information I computed the mortality rate as: \[ Mortality Rate = \frac{tot.mort}{population} \] Note that this is just a generic mortality rate, not a specific one. So, we are not considering the age distribution of the population. Given that within the period 1980-1990 the Milan population aged, even If the mortality at each age had not changed, the generic mortality rate would have increased.

Explanatory variables

In the following plots it is represented the distribution of the explanatory variable. As we can see, SO2 is strongly skewed. To deal with this problem, I logarithmically transformed it. Specifically, I computed the following transformation:

SO2_log = log(SO2 + 30)

Multivariate descriptive analysis

Seasonal effect

Mortality

In the following plots we can see the trend of the mortality rate and the respiratory mortality rate through time. As we can see, there is a strong seasonal effect, with a higher mortality in winter and a lower mortality in summer. We can spot some outliers in summer 1983 and in the first quarter of 1986. In those periods the mortality was much higher than the one registered in the same period in other years.

In the following plot we can compare the mortality pattern through the year between different years. As we can see, they are quite similar.

Explanatory variables

In the following plots we can see the trend of the explanatory variables through time. As we can see, all these variables have a strong dependency with time. That means that an eventual strong correlation between these variables and the mortality could be due to a spurious correlation. For example, during winter:

  • people work more, so they are more stressed and the accidents on the job are more frequent;
  • the light time is shorter, and that has an impact on the health;
  • there is more traffic, so the car accidents are more frequent.

To deal with this problem, I inserted the period of the year in the model with a spline computed on the day of the year.

Observing the mean.temp, trend we can see that in summer 1983, it was hotter than other summers. That could partially explain the higher mortality of that period.

Relationships between variables

In the following plot we can see the relationships between the variables. As we can see, many of them are not linear, therefore a GLM could perform poorly and a GAM could be more suitable. In particular, from the plot that represents the mean temperature and the mortality rate (mean.temp, tot_mort_prob), we can see that the mortality rate is higher when it is cold and decrease with warmer temperature, but it rapidly increase with really high temperature.

Statistical Models

Without considering time

The best predictive model for mortality rate without taking into account the time I found is the following:

\[ \begin{align} \frac{Y_i}{pop_i} &= \beta_0 + s(temp_i, humid_i) + \beta_1 SO2^{log}_i + s(TSP_i) + \epsilon_i \\ \epsilon_i &\sim \mathcal{N}(0,\sigma^2) \end{align} \] where:

  • \(i\) represent a specific day;
  • \(Y_i\) is the number of deaths in day \(i\);
  • \(pop_i\) is the population in day \(i\);
  • \(temp_i\) is the average temperature in day \(i\);
  • \(SO2_i\) is the transformed sulphur dioxide level in ambient air in day \(i\);
  • \(TPS_i\) is the total suspended particles in ambient air in day \(i\).

The criterion for choosing it has been the AIC. To speed up the process, I decided to use classical GAM models and not Bayesian models.

The estimations are represented in the following chunk:

fit_gam_noday <- gam(tot_mort_prob ~ s(mean.temp, rel.humid) + SO2_log + s(TSP),
                       family = gaussian,
                       data = mort)

summary(fit_gam_noday)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## tot_mort_prob ~ s(mean.temp, rel.humid) + SO2_log + s(TSP)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  13.9992     0.8702  16.087  < 2e-16 ***
## SO2_log       1.4514     0.1844   7.872 4.56e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                           edf Ref.df      F  p-value    
## s(mean.temp,rel.humid) 25.792 28.383 10.275  < 2e-16 ***
## s(TSP)                  3.588  4.511  5.981 4.47e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.259   Deviance explained = 26.5%
## GCV = 17.714  Scale est. = 17.561    n = 3652
par(mar = c(4,4,4,3))
layout(matrix(c(1,1,2,3), ncol = 2))
plot(fit_gam_noday,
     shade = T, scheme = 2,
     residuals = T, all.terms = T)

As we previously observed, we can’t say if the observed effect of the explanatory variables on the mortality is due to a cause-effect relation or to a spurious correlation.

Considering day

Model for mortality rate

The best predictive model for mortality rate I found is the following:

\[ \begin{align} \frac{Y_i}{pop_i} &= \beta_0 + s(temp_i) + s(doy_i) + \beta_1 year_{1i} + \dots + \beta_J year_{Ji} + \beta_{J+1} weekend_i + \epsilon_i \\ \epsilon_i &\sim \mathcal{N}(0,\sigma^2) \end{align} \] where:

  • \(i\) represent a specific day;
  • \(Y_i\) is the number of deaths in day \(i\);
  • \(pop_i\) is the population in day \(i\);
  • \(temp_i\) is the average temperature in day \(i\);
  • \(doy_i\) is the day of the year of the day \(i\) (it is a value between 1, that corresponds to 1st January, and 365, that corresponds to 31st December);
  • \(year_{1i}, \dots, year_{Ji}\) are the dummy variables that represent the year (in our data the years are \(1980, 1981, \dots, 1989\));
  • \(weekend_i\) is a variable that indicate whether \(i\) is a weekend day (Saturday or Sunday).

To estimate the parameters I used a Bayesian approach with non-informative prior \(\pi(\beta)\propto k\in]0,+\infty[\).

In order to perform the variable selection I tried several GAM with a classical approach for estimating the parameters, because a Bayesian estimation would have taken too much time and because, using non-informative priors, the estimated values are quite similar.

The estimations are represented in the following chunk:

SEED = 1234
CHAINS = 4
CORES = 4
ITER = 500

fit_gam_bayes_best <- stan_gamm4(tot_mort_prob ~ s(mean.temp) + s(day.of.year) + factor(year) + factor(weekend),
                                 family = gaussian, data = mort,
                                 chains = CHAINS, iter = ITER, seed = SEED, cores = CORES)

coeff 2.5% mean 97.5% signif
(Intercept) 21.34 21.74 22.15 .***
factor(year)1 -0.91 -0.34 0.19
factor(year)2 -1.62 -1.08 -0.52 .***
factor(year)3 -0.30 0.27 0.86
factor(year)4 -1.65 -1.11 -0.56 .***
factor(year)5 -1.22 -0.67 -0.11 .*
factor(year)6 -0.96 -0.39 0.20
factor(year)7 -2.07 -1.54 -1.03 .***
factor(year)8 -2.13 -1.57 -0.99 .***
factor(year)9 -1.87 -1.29 -0.72 .***
factor(year)10 -5.33 -0.70 3.94
factor(weekend)TRUE -0.77 -0.49 -0.19 .***
s(mean.temp).1 -25.18 -6.38 11.47
s(mean.temp).2 -6.78 6.63 20.97
s(mean.temp).3 2.06 16.64 31.80 .*
s(mean.temp).4 -18.00 -7.38 2.39
s(mean.temp).5 3.57 14.12 24.25 .*
s(mean.temp).6 -15.52 -10.49 -5.51 .***
s(mean.temp).7 13.80 21.78 29.92 .***
s(mean.temp).8 -39.20 -26.88 -15.69 .***
s(mean.temp).9 26.68 59.51 91.63 .**
s(day.of.year).1 -22.58 -1.95 18.33
s(day.of.year).2 -21.22 -7.96 5.70
s(day.of.year).3 -69.97 -54.99 -40.88 .***
s(day.of.year).4 2.11 11.04 19.49 .*
s(day.of.year).5 21.54 29.20 37.19 .***
s(day.of.year).6 14.49 19.34 23.80 .***
s(day.of.year).7 8.34 12.00 16.22 .***
s(day.of.year).8 -9.02 -2.74 3.59
s(day.of.year).9 0.01 16.80 37.37 .*
sigma 3.90 3.99 4.07 .***
smooth_sd[s(mean.temp)1] 8.50 14.97 24.23 .***
smooth_sd[s(mean.temp)2] 11.37 24.99 42.00 .***
smooth_sd[s(day.of.year)1] 14.52 20.66 29.61 .***
smooth_sd[s(day.of.year)2] 1.34 11.23 25.36 .***

The following plot shows the marginal effect of day.of.year and mean.temp.

Considering interaction

I also investigated whether it is better to consider or not the interaction between day.of.year and mean.temp in the spline component of the model (i.e. considering s(mean.temp) + s(day.of.year) or s(mean.temp, day.of.year)).

fit_gam_bayes_int <- stan_gamm4(tot_mort_prob ~ s(day.of.year, mean.temp) + factor(year) + factor(weekend),
                                family = gaussian, data = mort,
                                chains = CHAINS, iter = ITER, seed = SEED, cores = CORES)

As we can see, plotting the joint effect of mean.temp and day.of.year for the two models, they are quite similar.

To choose the better among the two, I used the Leave One Out Information Criterion (LOO-IC). According to it, the model without the interaction performs better.

loo_best <- loo(fit_gam_bayes_best)
loo_int <- loo(fit_gam_bayes_int)

loo_best
## 
## Computed from 1000 by 3652 log-likelihood matrix
## 
##          Estimate   SE
## elpd_loo -10241.1 46.7
## p_loo        32.4  1.7
## looic     20482.2 93.4
## ------
## Monte Carlo SE of elpd_loo is 0.2.
## 
## Pareto k diagnostic values:
##                          Count Pct.    Min. n_eff
## (-Inf, 0.5]   (good)     3650  99.9%   370       
##  (0.5, 0.7]   (ok)          2   0.1%   313       
##    (0.7, 1]   (bad)         0   0.0%   <NA>      
##    (1, Inf)   (very bad)    0   0.0%   <NA>      
## 
## All Pareto k estimates are ok (k < 0.7).
## See help('pareto-k-diagnostic') for details.
loo_int
## 
## Computed from 1000 by 3652 log-likelihood matrix
## 
##          Estimate   SE
## elpd_loo -10266.5 49.2
## p_loo        39.1  1.3
## looic     20532.9 98.3
## ------
## Monte Carlo SE of elpd_loo is 0.2.
## 
## Pareto k diagnostic values:
##                          Count Pct.    Min. n_eff
## (-Inf, 0.5]   (good)     3650  99.9%   439       
##  (0.5, 0.7]   (ok)          2   0.1%   510       
##    (0.7, 1]   (bad)         0   0.0%   <NA>      
##    (1, Inf)   (very bad)    0   0.0%   <NA>      
## 
## All Pareto k estimates are ok (k < 0.7).
## See help('pareto-k-diagnostic') for details.
compare(loo_best,
        loo_int)
## elpd_diff        se 
##     -25.4      12.6

Model checking

The following plots are aimed to check whether the model assumptions are verified in the data.

From the overlay plot and the statistics computed on the whole dataset, the model looks fine.

Considering the statistics computed for each year, something strange appears. Looking to the plots for the standard deviation, we can see that in the years 1983 and 1986 the standard deviation in the data is much higher than expected, while in other years, like 1982, 1987 and 1988 it is lower than expected. That means that the model doesn’t describe properly the variance within years and the hypothesis of homoscedasticity is not verified.

From the following plot we can see that in the years 1983 and 1986 there are some outliers with a particularly high mortality. In 1983 this phenomenon happened during summer, while in 1986 it happened during the first quarter.

Model for respiratory mortality rate

For modeling the number of deaths for respiratory issues, I used a Poisson distribution.

The best predictive model for respiratory mortality rate I found is the following:

\[ \begin{align} Y^{resp}_i &\sim \mathcal{Poisson}(\lambda_i) \\ \frac{\lambda_i}{pop_i} &= e^{\beta_0 + s(temp_i) + s(doy_i) + \beta_1 year_{1i} + \dots + \beta_J year_{Ji} + \beta_{J+1} SO2^{log}_i} \end{align} \] where:

  • \(i\) represent a specific day;
  • \(Y^{resp}_i\) is the number of deaths for respiratory issues in day \(i\);
  • \(pop_i\) is the population in day \(i\);
  • \(temp_i\) is the average temperature in day \(i\);
  • \(doy_i\) is the day of the year of the day \(i\) (it is a value between 1, that corresponds to 1st January, and 365, that corresponds to 31st December);
  • \(year_{1i}, \dots, year_{Ji}\) are the dummy variables that represent the year (in our data the years are \(1980, 1981, \dots, 1989\));
  • \(SO2_i\) is the transformed sulphur dioxide level in ambient air in day \(i\).

The estimations are represented in the following chunk:

fit_resp_gam_bayes_best <- stan_gamm4(resp.mort ~ offset(log(pop_m)) + s(mean.temp) + SO2_log + s(day.of.year) + factor(year),
                                      family = poisson, data = mort,
                                      chains = CHAINS, iter = ITER, seed = SEED, cores = CORES)

coeff 2.5% mean 97.5% signif
(Intercept) -0.34 0.03 0.40
SO2_log 0.03 0.10 0.17 .***
factor(year)1 -0.18 -0.10 -0.01 .*
factor(year)2 -0.33 -0.24 -0.15 .***
factor(year)3 -0.17 -0.09 0.00 .
factor(year)4 -0.39 -0.30 -0.20 .***
factor(year)5 -0.26 -0.17 -0.08 .***
factor(year)6 -0.27 -0.17 -0.07 .**
factor(year)7 -0.46 -0.36 -0.25 .***
factor(year)8 -0.40 -0.28 -0.17 .***
factor(year)9 -0.43 -0.32 -0.21 .***
factor(year)10 -0.88 -0.12 0.53
s(mean.temp).1 -1.64 0.43 2.34
s(mean.temp).2 -2.02 -0.30 1.41
s(mean.temp).3 -2.27 -0.78 0.82
s(mean.temp).4 -1.72 -0.23 0.90
s(mean.temp).5 -0.60 0.37 1.58
s(mean.temp).6 -1.54 -0.85 -0.12 .*
s(mean.temp).7 0.26 0.87 1.79 .**
s(mean.temp).8 -3.99 -2.41 -1.04 .***
s(mean.temp).9 -0.44 1.45 4.77
s(day.of.year).1 -3.73 -1.01 1.28
s(day.of.year).2 -2.70 -0.80 1.03
s(day.of.year).3 -3.97 -1.76 0.29 .
s(day.of.year).4 -1.68 -0.31 1.03
s(day.of.year).5 -0.64 0.65 2.00
s(day.of.year).6 1.06 1.98 2.85 .***
s(day.of.year).7 1.02 1.60 2.17 .***
s(day.of.year).8 -1.56 -0.58 0.30
s(day.of.year).9 0.61 3.03 5.43 .*
smooth_sd[s(mean.temp)1] 0.64 1.31 2.36 .***
smooth_sd[s(mean.temp)2] 0.06 1.52 4.36 .***
smooth_sd[s(day.of.year)1] 0.79 1.52 2.70 .***
smooth_sd[s(day.of.year)2] 0.68 2.41 5.42 .***

The following plot shows the marginal effect of day.of.year and mean.temp.

Model checking

The following plots are aimed to check whether the model assumptions are verified in the data.

As for the previous model, from the overlay plot and the statistics computed on the whole dataset, the model looks fine.

As in the previous model, the standard deviation of the response variable computed in the dataset for the year 1983 and 1986 is higher than expected.

As we saw for mortality in general, we can see that in the years 1983 and 1986 there are some outliers with a particularly high respiratory mortality. In 1983 this phenomenon happened during summer, while in 1986 it happened during the first quarter.

Model interpretations

The variables that have a significative effect on the response are:

  • for mortality in general:
    • mean.temp
    • day.of.year
    • year
    • weekend
  • for respiratory mortality:
    • mean.temp
    • day.of.year
    • year
    • SO2

That means that, even considering the time as an explanatory variable, mean.temp has a significative effect. Looking to the marginal effect plot for both the model (in the previous chapter) we can see that the mortality is higher when it is really cold and when it is really hot. On the contrary, rel.humid doesn’t have a significative effect on mortality.

For what concerns the pollution variables (SO2 and TSP), we can see that TSP doesn’t have a significative effect on mortality, while SO2 has a significative effect if we consider only the death for respiratory issues. Furthermore, the SO2 coefficient is positive, so the hypothesis that the concentration of sulphur dioxide has an impact on respiratory death is supported by the data. However, since the respiratory issues are just a small part of the whole cause of death, it doesn’t result significative on the mortality in general.

The weekend has a significative effect on the mortality in general, while it doesn’t affect respiratory mortality. The coefficient of weekend is negative, so on weekend the mortality is lower. This is probably due to the fact that on weekend people don’t work and there is less traffic, so there are less accidents at work and by car and the people are more relaxed and less stressed.

Possible improvements

A great improvement for the model would be obtained by using new variables as the cause of death and the age at death. With the cause of death we could build a model for each cause and then ensemble them to built a more accurate model. With the age at death we could build models for specific mortality rate and take into account the age distribution.

To improve the estimations without getting more data we could consider the effect of day.of.year not just as a spline on the values between 1 and 365, but as a periodic function, with period 365. That would bring to a function with \(s(0) = s(365)\) that would be more reasonable for our problem.

To deal with the outliers we could also consider other kind of models, like the Quasi-Poisson and the Negative Binomial. I tried with the Negative Binomial, but I incurred in trouble with the non-convergence of the Markov Chains.

Another idea would be to insert the effect of the year not just as an additive factor, but as another spline function on the day number (that goes from \(1\) to \(10\times 365\)). Thus the effect would be: s(day.of.year) + s(day.num). This approach would better fit the data, but it would reduce the interpretability of the model.

Otherwise, keeping the year additive effect, we can consider to build a hierarchical model in which year constitute the deeper level and day.of.year the higher level. This model would be useful to make predictions on the future because the year wouldn’t be an explanatory variable with realizations only in \(1, \dots, J\).

Finally a different approach that could work fine with this data is an Auto Regressive model, that could deal with the seasonality adding the correlation between the mortality in each day and the mortality in 365 days before.